docs: add deployment guidance for llm fine-tuning examples#1740
Open
gmartini2000 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Open
docs: add deployment guidance for llm fine-tuning examples#1740gmartini2000 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
gmartini2000 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Giulio Martini <martinigiulio02@gmail.com>
akoumpa
reviewed
Apr 27, 2026
Contributor
akoumpa
left a comment
There was a problem hiding this comment.
A few inline suggestions on the new README.
| @@ -0,0 +1,64 @@ | |||
| # LLM Fine-Tuning Examples | |||
|
|
|||
| This directory contains NeMo AutoModel LLM fine-tuning recipes organized by model family. Each subdirectory provides YAML configs for a specific family, such as Llama, Mistral, Qwen, Gemma, Nemotron, and others. The main AutoModel README identifies `examples/llm_finetune/` as the location for LLM fine-tune configs and shows these recipes being launched through the `automodel` CLI. | |||
Contributor
There was a problem hiding this comment.
Suggested change
| This directory contains NeMo AutoModel LLM fine-tuning recipes organized by model family. Each subdirectory provides YAML configs for a specific family, such as Llama, Mistral, Qwen, Gemma, Nemotron, and others. The main AutoModel README identifies `examples/llm_finetune/` as the location for LLM fine-tune configs and shows these recipes being launched through the `automodel` CLI. | |
| This directory holds YAML recipes for fine-tuning LLMs with NeMo AutoModel. Each recipe pairs a config (the YAML) with a recipe class (here, `TrainFinetuneRecipeForNextTokenPrediction`); you launch it with the `automodel` CLI. | |
| Pick your path: | |
| | Goal | Recipe variant | Launch | | |
| | ------------------------ | --------------------------------------- | ------------------------------------- | | |
| | Full SFT, single node | `<family>/<model>_<dataset>.yaml` | `automodel <yaml> --nproc-per-node N` | | |
| | LoRA / PEFT, single node | `<family>/<model>_<dataset>_peft.yaml` | same as above | | |
| | Multi-node on SLURM | any of the above | `sbatch` (see *Multi-Node Launches*) | | |
| Subdirectories group recipes by model family (Llama, Mistral, Qwen, Gemma, Nemotron, …). |
|
|
||
| ## Running a Recipe | ||
|
|
||
| Set up the environment with `uv`, then launch a recipe with `automodel`: |
Contributor
There was a problem hiding this comment.
Suggested change
| Set up the environment with `uv`, then launch a recipe with `automodel`: | |
| Recipes are launched through the `automodel` CLI (or its short alias `am`) — both are console scripts wrapping [`nemo_automodel/cli/app.py`](../../nemo_automodel/cli/app.py). For full setup and CLI options, see the [main README](../../README.md#getting-started); for end-to-end examples, see the [LLM SFT](../../README.md#llm-supervised-fine-tuning-sft) and [PEFT](../../README.md#llm-parameter-efficient-fine-tuning-peft) sections. Full reference docs: [docs.nvidia.com/nemo/automodel](https://docs.nvidia.com/nemo/automodel/latest/index.html). | |
| Set up the environment with `uv`, then run a recipe: |
| automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml --nproc-per-node 8 | ||
| ``` | ||
|
|
||
| These commands follow the repository's documented setup and launch pattern. |
Contributor
There was a problem hiding this comment.
Suggested change
| These commands follow the repository's documented setup and launch pattern. |
Comment on lines
+23
to
+31
| ## Important Note on `finetune.py` | ||
|
|
||
| A legacy `finetune.py` entry point exists in this directory, but it is deprecated. The script emits a deprecation warning and explicitly instructs users to launch recipes with: | ||
|
|
||
| ```bash | ||
| automodel <config.yaml> [--nproc-per-node N] | ||
| ``` | ||
|
|
||
| So new documentation in this directory should prefer `automodel` over `python finetune.py`. This is also consistent with the main README's documented usage. The inspected script loads a config, constructs `TrainFinetuneRecipeForNextTokenPrediction`, then runs `setup()` followed by `run_train_validation_loop()`, which confirms that these examples are training-entry recipes rather than deployment scripts. |
Contributor
There was a problem hiding this comment.
Suggested change
| ## Important Note on `finetune.py` | |
| A legacy `finetune.py` entry point exists in this directory, but it is deprecated. The script emits a deprecation warning and explicitly instructs users to launch recipes with: | |
| ```bash | |
| automodel <config.yaml> [--nproc-per-node N] | |
| ``` | |
| So new documentation in this directory should prefer `automodel` over `python finetune.py`. This is also consistent with the main README's documented usage. The inspected script loads a config, constructs `TrainFinetuneRecipeForNextTokenPrediction`, then runs `setup()` followed by `run_train_validation_loop()`, which confirms that these examples are training-entry recipes rather than deployment scripts. | |
| > [!NOTE] | |
| > A legacy `finetune.py` still exists in this directory but is deprecated — it prints a `DeprecationWarning` and tells you to use `automodel` instead. Do not write new docs or examples around it. |
Comment on lines
+38
to
+39
| cp slurm.sub my_cluster.sub | ||
| sbatch my_cluster.sub |
Contributor
There was a problem hiding this comment.
slurm.sub is at the repo root, not in this directory.
Suggested change
| cp slurm.sub my_cluster.sub | |
| sbatch my_cluster.sub | |
| cp ../../slurm.sub my_cluster.sub # slurm.sub lives at the repo root | |
| # edit my_cluster.sub: --nodes, --partition, container image, mounts, recipe path | |
| sbatch my_cluster.sub |
| sbatch my_cluster.sub | ||
| ``` | ||
|
|
||
| Cluster-specific settings such as nodes, GPUs, partition, container, and mounts should be defined in the sbatch script. NeMo-Run sections are also supported through the cluster guide. |
Contributor
There was a problem hiding this comment.
Suggested change
| Cluster-specific settings such as nodes, GPUs, partition, container, and mounts should be defined in the sbatch script. NeMo-Run sections are also supported through the cluster guide. | |
| Cluster-specific settings (`--nodes`, `--gpus`, `--partition`, container image, mounts, recipe path) live in the sbatch script. For the NeMo-Run launcher, see [`docs/launcher/slurm.md`](../../docs/launcher/slurm.md). |
Comment on lines
+48
to
+54
| ## Deployment Guidance | ||
|
|
||
| This examples directory does not currently document a single canonical deployment command for all fine-tuned LLM recipes. Based on the materials reviewed here, the safest documented guidance is: | ||
|
|
||
| 1. **Use the generated checkpoints in your follow-up evaluation or inference workflow.** | ||
| 2. **Use AutoModel's documented container workflow** when you want a reproducible GPU-backed environment. The contributing guide documents both the AutoModel container path and a custom Docker build path. | ||
| 3. **Refer to the broader NeMo and AutoModel documentation for production deployment architecture**, rather than assuming a serving/export API directly from these training examples. The repository positions AutoModel as part of the broader NeMo ecosystem for scalable training and deployment-oriented environments. |
Contributor
There was a problem hiding this comment.
Suggested change
| ## Deployment Guidance | |
| This examples directory does not currently document a single canonical deployment command for all fine-tuned LLM recipes. Based on the materials reviewed here, the safest documented guidance is: | |
| 1. **Use the generated checkpoints in your follow-up evaluation or inference workflow.** | |
| 2. **Use AutoModel's documented container workflow** when you want a reproducible GPU-backed environment. The contributing guide documents both the AutoModel container path and a custom Docker build path. | |
| 3. **Refer to the broader NeMo and AutoModel documentation for production deployment architecture**, rather than assuming a serving/export API directly from these training examples. The repository positions AutoModel as part of the broader NeMo ecosystem for scalable training and deployment-oriented environments. | |
| ## Deployment | |
| These examples are training recipes; this directory does not own a deployment path. See the [main README](../../README.md) and the [NeMo AutoModel docs](https://docs.nvidia.com/nemo/automodel/latest/index.html) for serving and export guidance. |
Comment on lines
+56
to
+64
| ## Development Notes | ||
|
|
||
| If you update documentation here, the contributing guide points contributors to the documentation development guide and requires signed-off commits: | ||
|
|
||
| ```bash | ||
| git commit -s -m "docs: add llm finetune README" | ||
| ``` | ||
|
|
||
| Unsigned commits are not accepted. No newline at end of file |
Contributor
There was a problem hiding this comment.
Suggested change
| ## Development Notes | |
| If you update documentation here, the contributing guide points contributors to the documentation development guide and requires signed-off commits: | |
| ```bash | |
| git commit -s -m "docs: add llm finetune README" | |
| ``` | |
| Unsigned commits are not accepted. |
Contributor
|
Hi @gmartini2000 , thanks for making the doc, and I apologize for the delayed response. I think this is a good doc to include as a readme for the recipes folder, I've added some suggestions, please let me know what you think. Thank you. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Adds a README to the LLM fine-tuning examples directory that provides guidance on how to run recipes, clarifies the deprecation of
finetune.py, and outlines next steps after training including basic deployment direction.Changelog
examples/llm_finetune/README.mdautomodelCLIfinetune.pyis deprecatedBefore your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information